Remove `apache_beam` import in `BeamBasedBuilder._save_info` #6265

mariosasko · 2023-09-27T13:56:34Z

... to avoid an ImportError raised in BeamBasedBuilder._save_info when apache_beam is not installed (e.g., when downloading the processed version of a dataset from the HF GCS)

Fix #6260

…am-import

HuggingFaceDocBuilderDev · 2023-09-27T14:02:21Z

The documentation is not available anymore as the PR was closed or merged.

github-actions · 2023-09-27T15:22:47Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.005896 / 0.011353 (-0.005457)	0.003642 / 0.011008 (-0.007366)	0.081917 / 0.038508 (0.043409)	0.059513 / 0.023109 (0.036404)	0.341422 / 0.275898 (0.065524)	0.359278 / 0.323480 (0.035798)	0.004707 / 0.007986 (-0.003279)	0.002938 / 0.004328 (-0.001390)	0.063095 / 0.004250 (0.058845)	0.051777 / 0.037052 (0.014725)	0.321114 / 0.258489 (0.062625)	0.363823 / 0.293841 (0.069982)	0.027590 / 0.128546 (-0.100957)	0.007846 / 0.075646 (-0.067800)	0.261197 / 0.419271 (-0.158074)	0.045812 / 0.043533 (0.002279)	0.319787 / 0.255139 (0.064648)	0.341839 / 0.283200 (0.058640)	0.021913 / 0.141683 (-0.119770)	1.397525 / 1.452155 (-0.054630)	1.495902 / 1.492716 (0.003186)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.224815 / 0.018006 (0.206809)	0.425780 / 0.000490 (0.425290)	0.006934 / 0.000200 (0.006734)	0.000225 / 0.000054 (0.000171)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.024342 / 0.037411 (-0.013070)	0.073923 / 0.014526 (0.059398)	0.082108 / 0.176557 (-0.094448)	0.143017 / 0.737135 (-0.594119)	0.083163 / 0.296338 (-0.213175)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.398244 / 0.215209 (0.183035)	3.957688 / 2.077655 (1.880033)	1.904615 / 1.504120 (0.400495)	1.710353 / 1.541195 (0.169158)	1.798980 / 1.468490 (0.330490)	0.499307 / 4.584777 (-4.085470)	3.026734 / 3.745712 (-0.718978)	2.923940 / 5.269862 (-2.345922)	1.831870 / 4.565676 (-2.733807)	0.058551 / 0.424275 (-0.365724)	0.006403 / 0.007607 (-0.001204)	0.464164 / 0.226044 (0.238119)	4.644556 / 2.268929 (2.375628)	2.341455 / 55.444624 (-53.103169)	2.004385 / 6.876477 (-4.872092)	2.051819 / 2.142072 (-0.090253)	0.585610 / 4.805227 (-4.219617)	0.124735 / 6.500664 (-6.375929)	0.061150 / 0.075469 (-0.014319)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.224665 / 1.841788 (-0.617122)	17.476227 / 8.074308 (9.401919)	13.867617 / 10.191392 (3.676225)	0.144177 / 0.680424 (-0.536247)	0.017045 / 0.534201 (-0.517156)	0.337468 / 0.579283 (-0.241815)	0.374476 / 0.434364 (-0.059888)	0.393428 / 0.540337 (-0.146910)	0.535335 / 1.386936 (-0.851601)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006208 / 0.011353 (-0.005145)	0.003650 / 0.011008 (-0.007359)	0.062843 / 0.038508 (0.024335)	0.062272 / 0.023109 (0.039162)	0.446336 / 0.275898 (0.170438)	0.477476 / 0.323480 (0.153996)	0.004862 / 0.007986 (-0.003124)	0.002822 / 0.004328 (-0.001506)	0.063427 / 0.004250 (0.059177)	0.049023 / 0.037052 (0.011971)	0.453633 / 0.258489 (0.195144)	0.486494 / 0.293841 (0.192653)	0.028634 / 0.128546 (-0.099912)	0.008187 / 0.075646 (-0.067460)	0.068846 / 0.419271 (-0.350425)	0.041104 / 0.043533 (-0.002429)	0.446646 / 0.255139 (0.191507)	0.468860 / 0.283200 (0.185660)	0.020980 / 0.141683 (-0.120703)	1.455565 / 1.452155 (0.003410)	1.511142 / 1.492716 (0.018426)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.224242 / 0.018006 (0.206236)	0.408483 / 0.000490 (0.407993)	0.003495 / 0.000200 (0.003296)	0.000076 / 0.000054 (0.000022)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.027286 / 0.037411 (-0.010125)	0.081151 / 0.014526 (0.066625)	0.096598 / 0.176557 (-0.079959)	0.146193 / 0.737135 (-0.590942)	0.092213 / 0.296338 (-0.204125)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.463837 / 0.215209 (0.248628)	4.636820 / 2.077655 (2.559165)	2.576100 / 1.504120 (1.071980)	2.396974 / 1.541195 (0.855779)	2.461526 / 1.468490 (0.993036)	0.502360 / 4.584777 (-4.082417)	3.099973 / 3.745712 (-0.645739)	2.937260 / 5.269862 (-2.332602)	1.871274 / 4.565676 (-2.694402)	0.057913 / 0.424275 (-0.366362)	0.006511 / 0.007607 (-0.001096)	0.536917 / 0.226044 (0.310873)	5.396966 / 2.268929 (3.128038)	3.015646 / 55.444624 (-52.428978)	2.673793 / 6.876477 (-4.202684)	2.712376 / 2.142072 (0.570304)	0.591632 / 4.805227 (-4.213595)	0.124872 / 6.500664 (-6.375792)	0.061820 / 0.075469 (-0.013649)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.356828 / 1.841788 (-0.484960)	18.076995 / 8.074308 (10.002687)	15.116482 / 10.191392 (4.925090)	0.151375 / 0.680424 (-0.529049)	0.017867 / 0.534201 (-0.516334)	0.335012 / 0.579283 (-0.244271)	0.384137 / 0.434364 (-0.050226)	0.397792 / 0.540337 (-0.142546)	0.551521 / 1.386936 (-0.835415)

github-actions · 2023-09-28T15:53:37Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.009418 / 0.011353 (-0.001935)	0.005186 / 0.011008 (-0.005822)	0.112270 / 0.038508 (0.073761)	0.114856 / 0.023109 (0.091747)	0.402267 / 0.275898 (0.126369)	0.445213 / 0.323480 (0.121733)	0.005588 / 0.007986 (-0.002398)	0.004315 / 0.004328 (-0.000013)	0.083561 / 0.004250 (0.079311)	0.087319 / 0.037052 (0.050267)	0.400989 / 0.258489 (0.142500)	0.455636 / 0.293841 (0.161795)	0.045168 / 0.128546 (-0.083378)	0.010939 / 0.075646 (-0.064707)	0.400120 / 0.419271 (-0.019151)	0.071599 / 0.043533 (0.028066)	0.418112 / 0.255139 (0.162973)	0.443889 / 0.283200 (0.160690)	0.032433 / 0.141683 (-0.109250)	1.886313 / 1.452155 (0.434159)	2.012909 / 1.492716 (0.520193)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.306991 / 0.018006 (0.288985)	0.590426 / 0.000490 (0.589937)	0.011811 / 0.000200 (0.011611)	0.000596 / 0.000054 (0.000542)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.042520 / 0.037411 (0.005108)	0.129808 / 0.014526 (0.115283)	0.125481 / 0.176557 (-0.051075)	0.199181 / 0.737135 (-0.537954)	0.130426 / 0.296338 (-0.165913)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.526455 / 0.215209 (0.311246)	5.213304 / 2.077655 (3.135649)	2.643406 / 1.504120 (1.139286)	2.611214 / 1.541195 (1.070019)	2.586730 / 1.468490 (1.118240)	0.639103 / 4.584777 (-3.945674)	5.197421 / 3.745712 (1.451709)	4.634642 / 5.269862 (-0.635220)	2.741079 / 4.565676 (-1.824598)	0.073064 / 0.424275 (-0.351211)	0.009441 / 0.007607 (0.001834)	0.635984 / 0.226044 (0.409940)	6.283268 / 2.268929 (4.014339)	3.337205 / 55.444624 (-52.107419)	3.192362 / 6.876477 (-3.684114)	2.910367 / 2.142072 (0.768294)	0.767937 / 4.805227 (-4.037290)	0.177467 / 6.500664 (-6.323198)	0.081162 / 0.075469 (0.005693)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.803717 / 1.841788 (-0.038071)	26.823235 / 8.074308 (18.748927)	19.714471 / 10.191392 (9.523079)	0.204048 / 0.680424 (-0.476376)	0.025992 / 0.534201 (-0.508209)	0.521438 / 0.579283 (-0.057845)	0.596524 / 0.434364 (0.162160)	0.600763 / 0.540337 (0.060425)	0.945971 / 1.386936 (-0.440965)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.009126 / 0.011353 (-0.002226)	0.005109 / 0.011008 (-0.005899)	0.083046 / 0.038508 (0.044538)	0.115930 / 0.023109 (0.092821)	0.534311 / 0.275898 (0.258413)	0.552846 / 0.323480 (0.229366)	0.007240 / 0.007986 (-0.000746)	0.004617 / 0.004328 (0.000289)	0.083927 / 0.004250 (0.079676)	0.075926 / 0.037052 (0.038873)	0.534750 / 0.258489 (0.276261)	0.575122 / 0.293841 (0.281281)	0.041001 / 0.128546 (-0.087545)	0.010851 / 0.075646 (-0.064795)	0.096574 / 0.419271 (-0.322697)	0.063533 / 0.043533 (0.020001)	0.546850 / 0.255139 (0.291711)	0.547122 / 0.283200 (0.263922)	0.032437 / 0.141683 (-0.109245)	1.926191 / 1.452155 (0.474036)	2.029841 / 1.492716 (0.537125)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.275582 / 0.018006 (0.257576)	0.574212 / 0.000490 (0.573722)	0.006863 / 0.000200 (0.006663)	0.000236 / 0.000054 (0.000181)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.045340 / 0.037411 (0.007928)	0.129196 / 0.014526 (0.114670)	0.136637 / 0.176557 (-0.039920)	0.200040 / 0.737135 (-0.537096)	0.136328 / 0.296338 (-0.160011)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.612379 / 0.215209 (0.397170)	5.874664 / 2.077655 (3.797010)	3.070626 / 1.504120 (1.566506)	2.999319 / 1.541195 (1.458124)	3.000571 / 1.468490 (1.532081)	0.732119 / 4.584777 (-3.852658)	5.193226 / 3.745712 (1.447514)	4.714571 / 5.269862 (-0.555291)	2.870438 / 4.565676 (-1.695239)	0.075793 / 0.424275 (-0.348482)	0.009238 / 0.007607 (0.001631)	0.695192 / 0.226044 (0.469148)	6.897996 / 2.268929 (4.629067)	3.923474 / 55.444624 (-51.521150)	3.458326 / 6.876477 (-3.418151)	3.331652 / 2.142072 (1.189579)	0.821132 / 4.805227 (-3.984095)	0.182252 / 6.500664 (-6.318412)	0.084730 / 0.075469 (0.009260)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.919861 / 1.841788 (0.078073)	27.437228 / 8.074308 (19.362920)	21.109899 / 10.191392 (10.918507)	0.245998 / 0.680424 (-0.434426)	0.025817 / 0.534201 (-0.508384)	0.517757 / 0.579283 (-0.061526)	0.576375 / 0.434364 (0.142011)	0.625283 / 0.540337 (0.084945)	0.956877 / 1.386936 (-0.430059)

lhoestq

Cool ! It should get the same credentials as the beam filesystem via environment variables or using credentials files on disk so we're fine

github-actions · 2023-09-28T18:34:02Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.008099 / 0.011353 (-0.003254)	0.004815 / 0.011008 (-0.006194)	0.099657 / 0.038508 (0.061149)	0.064737 / 0.023109 (0.041628)	0.461773 / 0.275898 (0.185875)	0.444810 / 0.323480 (0.121330)	0.004247 / 0.007986 (-0.003739)	0.004956 / 0.004328 (0.000628)	0.068664 / 0.004250 (0.064414)	0.052039 / 0.037052 (0.014986)	0.406750 / 0.258489 (0.148261)	0.452832 / 0.293841 (0.158991)	0.044518 / 0.128546 (-0.084028)	0.013220 / 0.075646 (-0.062426)	0.317713 / 0.419271 (-0.101558)	0.061897 / 0.043533 (0.018364)	0.398664 / 0.255139 (0.143525)	0.531494 / 0.283200 (0.248294)	0.064033 / 0.141683 (-0.077650)	1.590385 / 1.452155 (0.138231)	1.769918 / 1.492716 (0.277202)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.230795 / 0.018006 (0.212789)	0.568797 / 0.000490 (0.568308)	0.013498 / 0.000200 (0.013298)	0.000448 / 0.000054 (0.000393)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.028394 / 0.037411 (-0.009017)	0.081973 / 0.014526 (0.067447)	0.097623 / 0.176557 (-0.078934)	0.158691 / 0.737135 (-0.578445)	0.101548 / 0.296338 (-0.194791)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.574459 / 0.215209 (0.359249)	5.709871 / 2.077655 (3.632217)	2.521460 / 1.504120 (1.017340)	2.239463 / 1.541195 (0.698268)	2.195067 / 1.468490 (0.726577)	0.792390 / 4.584777 (-3.792387)	4.841665 / 3.745712 (1.095952)	4.201620 / 5.269862 (-1.068241)	2.664081 / 4.565676 (-1.901595)	0.097661 / 0.424275 (-0.326614)	0.008428 / 0.007607 (0.000821)	0.698729 / 0.226044 (0.472684)	6.908867 / 2.268929 (4.639939)	3.247480 / 55.444624 (-52.197145)	2.563921 / 6.876477 (-4.312556)	2.738249 / 2.142072 (0.596177)	0.972066 / 4.805227 (-3.833161)	0.191196 / 6.500664 (-6.309468)	0.064732 / 0.075469 (-0.010737)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.421910 / 1.841788 (-0.419877)	20.633538 / 8.074308 (12.559230)	18.054562 / 10.191392 (7.863170)	0.194125 / 0.680424 (-0.486299)	0.028097 / 0.534201 (-0.506104)	0.417857 / 0.579283 (-0.161426)	0.518758 / 0.434364 (0.084394)	0.500199 / 0.540337 (-0.040138)	0.754662 / 1.386936 (-0.632274)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.008452 / 0.011353 (-0.002901)	0.004646 / 0.011008 (-0.006362)	0.077286 / 0.038508 (0.038778)	0.072507 / 0.023109 (0.049398)	0.439580 / 0.275898 (0.163682)	0.506166 / 0.323480 (0.182686)	0.006035 / 0.007986 (-0.001950)	0.003886 / 0.004328 (-0.000442)	0.075091 / 0.004250 (0.070841)	0.063163 / 0.037052 (0.026110)	0.468550 / 0.258489 (0.210061)	0.523273 / 0.293841 (0.229432)	0.048728 / 0.128546 (-0.079818)	0.012991 / 0.075646 (-0.062655)	0.087964 / 0.419271 (-0.331308)	0.058920 / 0.043533 (0.015387)	0.451247 / 0.255139 (0.196108)	0.489827 / 0.283200 (0.206628)	0.031164 / 0.141683 (-0.110519)	1.675504 / 1.452155 (0.223349)	1.806098 / 1.492716 (0.313382)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.253567 / 0.018006 (0.235561)	0.508971 / 0.000490 (0.508481)	0.010882 / 0.000200 (0.010682)	0.000111 / 0.000054 (0.000057)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.029490 / 0.037411 (-0.007921)	0.090255 / 0.014526 (0.075729)	0.110075 / 0.176557 (-0.066482)	0.159375 / 0.737135 (-0.577760)	0.109313 / 0.296338 (-0.187025)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.580252 / 0.215209 (0.365043)	5.911741 / 2.077655 (3.834086)	2.659405 / 1.504120 (1.155285)	2.344943 / 1.541195 (0.803749)	2.390748 / 1.468490 (0.922258)	0.827823 / 4.584777 (-3.756954)	4.973544 / 3.745712 (1.227832)	4.300220 / 5.269862 (-0.969642)	2.826181 / 4.565676 (-1.739495)	0.101013 / 0.424275 (-0.323263)	0.008025 / 0.007607 (0.000418)	0.728414 / 0.226044 (0.502369)	7.508045 / 2.268929 (5.239117)	3.687627 / 55.444624 (-51.756997)	2.902953 / 6.876477 (-3.973524)	3.094624 / 2.142072 (0.952551)	1.054696 / 4.805227 (-3.750531)	0.212297 / 6.500664 (-6.288367)	0.070211 / 0.075469 (-0.005258)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.567117 / 1.841788 (-0.274670)	21.420746 / 8.074308 (13.346438)	19.857467 / 10.191392 (9.666075)	0.228554 / 0.680424 (-0.451870)	0.032278 / 0.534201 (-0.501923)	0.459966 / 0.579283 (-0.119317)	0.541219 / 0.434364 (0.106855)	0.549599 / 0.540337 (0.009261)	0.731476 / 1.386936 (-0.655460)

mariosasko added 2 commits September 27, 2023 15:45

Remove apache_beam import in save_info in BeamBasedBuilder

d64e624

Merge branch 'main' of github.com:huggingface/datasets into remove-be…

a375fde

…am-import

mariosasko added 2 commits September 27, 2023 17:07

Oops :)

ddd9ba4

Style

46a0506

Merge branch 'main' into remove-beam-import

8ddee15

lhoestq approved these changes Sep 28, 2023

View reviewed changes

mariosasko marked this pull request as ready for review September 28, 2023 18:23

mariosasko merged commit 0cc77d7 into main Sep 28, 2023
9 of 13 checks passed

mariosasko deleted the remove-beam-import branch September 28, 2023 18:23

Remove apache_beam import in BeamBasedBuilder._save_info #6265

Remove apache_beam import in BeamBasedBuilder._save_info #6265

Conversation

mariosasko commented Sep 27, 2023 • edited Loading

HuggingFaceDocBuilderDev commented Sep 27, 2023 • edited Loading

github-actions bot commented Sep 27, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

github-actions bot commented Sep 28, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

lhoestq left a comment

Choose a reason for hiding this comment

github-actions bot commented Sep 28, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Remove `apache_beam` import in `BeamBasedBuilder._save_info` #6265

Remove `apache_beam` import in `BeamBasedBuilder._save_info` #6265

mariosasko commented Sep 27, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Sep 27, 2023 •

edited

Loading